Text Clustering with Random Indexing
نویسندگان
چکیده
This project explored how the language technology method Random Indexing can be used for clustering of texts from Swedish newspapers. The resulting Random Indexing based representation yields similar results as an ordinary representation when the number of clusters matches the real categories. With an increased number of clusters the Random Index based representation yields better results than the regular representation. Random Indexing is a scalable and computationally effective method that employs random projections to be able to compare all encountered contexts of the words. The downside of the method is that a certain level of random noise is added to the information content of each word. Unfortunately the small random disturbances become a real concern when combining a number of semantically related words. A number of methods were examined to make a Random Index based representation perform better. Weighting the context of a word with the Inverse Document Frequency and normalizing the resulting vector were found to be the most effective ways to leverage the information content for clustering purposes. The project also included experiments on removal of repeated words, filtering of words based on word frequency and the use of dampening in Random Indexing's weighting schemes. Finally the project examined how to detect programming errors that prevent the index to function properly. Calculating the variance of how the word representations are distributed in the Random Index was shown to be a possible way to find at least one type of severe error. Textklustering med Random Indexing Sammanfattning I detta projekt undersöks hur språkteknologimetoden Random Indexing kan användas för att klustra texter från svenska dagstidningar. Den framtagna representationen ger likvärdigt resultat som traditionellt brukade representationer när antalet kluster motsvarar antalet kategorier. Med ökande antal kluster ger den Random Index baserade varianten bättre resultat. Random Indexing är en skalbar och effektiv metod som använder slumpprojektioner för att kunna göra jämförelser av likhet mellan hur orden används i olika texter. Nackdelen med metoden är att en viss mängd brus introduceras i representationen. Tyvärr är dessa små slumpavvikelser som läggs till inte försumbara när man kombinerar ett antal semantiskt relaterade ord. Ett antal metoder undersöktes för att få den Random Index baserade representationen att prestera bättre. Viktning med Inversa Dokument Frekvensen av de ord som finns i närheten av ett ord samt normalisering av den resulterande vektorn visade sig vara de mest effektiva sätten att bevara informationen som behövs för textklustring. Projektet undersökte även filtrering av upprepade ord, filtrering baserad på ordfrekvens samt användning av dämpning i Random Indexings viktningsschema. Avslutningsvis undersökte projektet hur programmeringsfel som hindrar Random Indexing representationen att fungera kan upptäckas. Att beräkna variansen över hur ordens representationer finns distribuerade visade sig vara ett möjligt sätt att upptäcka åtminstone ett sorts allvarligt fel.
منابع مشابه
Discovering Word Senses from Text Using Random Indexing
Random Indexing is a novel technique for dimensionality reduction while creating Word Space model from a given text. This paper explores the possible application of Random Indexing in discovering word senses from the text. The words appearing in the text are plotted onto a multi-dimensional Word Space using Random Indexing. The geometric distance between words is used as an indicative of their ...
متن کاملComparing and Combining Dimension Reduction Techniques for Efficient Text Clustering
A great challenge of text mining arises from the increasingly large text datasets and the high dimensionality associated with natural language. In this research, a systematic study is conducted of six Dimension Reduction Techniques (DRT) in the context of the text clustering problem using three standard benchmark datasets. The methods considered include three feature transformation techiques, I...
متن کاملComparing Dimension Reduction Techniques for Document Clustering
In this research, a systematic study is conducted of four dimension reduction techniques for the text clustering problem, using five benchmark data sets. Of the four methods -Independent Component Analysis (ICA), Latent Semantic Indexing (LSI), Document Frequency (DF) and Random Projection (RP) -ICA and LSI are clearly superior when the k-means clustering algorithm is applied, irrespective of t...
متن کاملA Comparing between the impacts of text based indexing and folksonomy on ranking of images search via Google search engine
Background and Aim: The purpose of this study was to compare the impact of text based indexing and folksonomy in image retrieval via Google search engine. Methods: This study used experimental method. The sample is 30 images extracted from the book “Gray anatomy”. The research was carried out in 4 stages; in the first stage, images were uploaded to an “Instagram” account so the images are tagge...
متن کاملText Summarization using Random Indexing and PageRank
We present results from evaluations of an automatic text summarization technique that uses a combination of Random Indexing and PageRank. In our experiments we use two types of texts: news paper texts and government texts. Our results show that text type as well as other aspects of texts of the same type influence the performance. Combining PageRank and Random Indexing provides the best results...
متن کاملA Random Indexing Approach for Web User Clustering and Web Prefetching
In this paper we present a novel technique to capture Web users’ behaviour based on their interest-oriented actions. In our approach we utilise the vector space model Random Indexing to identify the latent factors or hidden relationships among Web users’ navigational behaviour. Random Indexing is an incremental vector space technique that allows for continuous Web usage mining. User requests ar...
متن کامل